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POST-SELECTION AND POST-REGULARIZATION INFERENCE IN 
LINEAR MODELS WITH MANY CONTROLS AND INSTRUMENTS 

VICTOR CHERNOZHUKOV, CHRISTIAN HANSEN, AND MARTIN SPINDLER 


Abstract. In this note, we offer an approach to estimating structural 
parameters in the presence of many instruments and controls based on 
methods for estimating sparse high-dimensional models. We use these 
high-dimensional methods to select both which instruments and which 
control variables to use. The approach we take extends IBelloni et al 


(2012), which covers selection of instrume nts for IV models with a small 
number of controls, and extends Belloni. Chernozhukov and Hansen ( 2014 1. 
which covers selection of controls in models where the variable of interest 
is exogenous conditional on observables, to accommodate both a large 
number of controls and a large number of instruments. We illustrate the 
approach with a simulation and an empirical example. Technical sup¬ 
porting material is available in a supplementary appendix. 
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Online Appendix: Post-Selection and Post-Regularization Inference: An 
Elementary, General Approach. 


1. Model and Estimation Approach 
Consider the linear IV model 

yi = a 0 di + x'ifio + £i, ( 1 ) 

di = t'to + 46 o + Ui, (2) 

with E [(z'^x'j)'Ei\ = E[(z(, x'^'ui] = 0. di is the scalar endogenous variable and a the 
coefficient of interest, Xi is a p^-vector of exogenous control variables, Zi is a p^-vector 
of instruments, n is the sample size, and p^ n and p* n are allowed. Extension 
to the case where di is a vector is straightforward and omitted for simplicity. We may 
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have that Zi and x* are correlated so that Zi are only valid instruments after controlling 
for Xi\ specifically, we let Zi = Tlxi + Q, for II a p^ x p ® matrix and Ci a p^-vector of 
unobservables with E[x'j£'] = 0. Substituting this expression for as a function of x* 
into @ and then further substituting into (HD gives a system for y l and di that depends 
only on xp. 

y% = x% + p\, (3) 

di = x% + pf, (4) 

with E [xipV] = 0 and F,[xipf] = 0. This model includes the many instruments and small 
number of controls case by setting |)JCn and can accommodate the exogenous case by 
setting Pn = 0 and imposing the additional condition E[cfe] = 0. 

Because the dimension of r/o = (Q' 0 , d' 0 , 7 o> ^oY may be larger than n, informative 
estimation and inference about ao is impossible without imposing restrictions on ?/o- 
For simplicity, we provide discussion under the assumption of exact sparsity and present 
a generalization to approximate sparsity in the supplemental material. Specifically, we 
assume that 

IM|o<Sn, Sntog (Pn+Pnf/n^ 0, 

where ||r/o||o denotes the number of non-zero elements of rjo- That is, sparsity requires 
that, among the p * + p * observed variables, the number of variables with non-zero 
coefficients is small relative to the sample size. This assumption then reduces the problem 
of estimating a to a problem of finding which instruments and controls to use in equations 

(HD and ©. 

The problem that arises is that variable selection techniques are not perfect and are 
prone to making selection mistakes. There are two kinds of selection mistakes: A variable 
may be deemed relevant when in fact it has a zero coefficient and thus has no true 
explanatory power, or a variable may be dropped from the model despite having a non¬ 
zero coefficient. Both types of mistakes may detrimentally affect post-model-selection 
estimators and inference for a. When irrelevant variables are spuriously included after 
being deemed predictive from looking at the data, overfitting occurs and importantly 
the spuriously included variables are those most correlated to the noise in the sample 
due to data-snooping which introduces a type of “endogeneity” bias. When relevant x 
variables are excluded, one is left with standard omitted variables bias. When relevant 
z variables are excluded, one loses identification power. This last concern 
with through appropriate use of weak identification robust inference as in 

(l2012l f. 


can be d ealt 
Belloni et al. 


The first type of mistake, the spurious inclusion of irrelevant variables, can be avoided 
through the use of modern, principled da ta-mining met hods. For example, we use the 
Lasso with tuning parameters chosen as in Belloni et al.l ( 201 2), and many other options 
are available. These methods differ from the unprincipled data-snooping that many 
economists associate with the term data-mining. Specifically, modern data-mining de¬ 
notes a principled search for true predictive power that guards against false discovery and 
overfitting, does not erroneously equate in-sample fit to out-of-sample predictive ability, 
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and accurately accounts for using the same data to examine many different hypotheses 
or models. 


Of course, guarding against the first type of error comes at the cost of needing to 
acknowledge that the exclusion of relevant variables is likely to occur. While sensible 
approaches such as Lasso will accurately find strong predictors, one can show that such 
procedures have non-negligible probability of missing predictors with small but non¬ 
zero coefficients. Exclusion of such predictors can have substantive im pacts o n inference 
for parameters of interest such as a. in our model; see, for example, Leeb and Pofscher 
( 2008l l . To overcome this difficulty, one needs to base estimation and inference on proce¬ 
dures that are robust to this type of model selection mistake. One such approach relies 
on using estimating equations that_are locally insensitive to this type of mistake, termed 


orthogonal moment functions in Belloni et al. ( 20131 ) 


In the IV model with many instruments and controls, such a moment condition is 
given by 

M(ao‘, Vo) = 0, M(a, 77 ) := E rj)\ (5) 

where ipi(a,rj ) = (pf - pfct)vi for rj := (O', d', 7', 8')', pf := t/i — x\0, pf := di - xft, and 
Vi := x'fi + z (5 — x'pd. When we set fj = rjo, we have pf = pf = y l — x[0q, pf = pf = 
di - x'flQ, and Vi = Vi := x'70 + z( 5 0 - = Cfy- 


We can see that small selection errors will have relatively little impact on estimation 
of ckq by noting that the following orthogonality condition holds: 


^M(a 0 ,r/) 


= 0 . 


77=770 


( 6 ) 


In other words, missing the true value 770 by a small amount does not invalidate the 
moment condition. Thus, estimators a of ckq based on the empirical analog of ([5jh 


M(d, fj) = 0 


( 7 ) 


with M(a,rj) := n~ ls ^2, r f_ i t?) 1 , can be shown to be “immunized” against small 

selection mistakes. See iBelloni et alJ ( 2013 1 for a general formulation of orthogonal 
moment funtions for use in sparse high-dimenionsal models and a number of estimation 
and inference results. 


Note that operationally using the empirical version of © to estimate ckq is equiva¬ 


lent to using the usual IV regression of p y on p a using v as instruments. 


argument, we suggest the following algorithm for estimating ckq based on the 
selection” strategy of Belloni. Chernozhukov and Hansen ( 201 il l. 


Based on this 
double- 


Algorithm 1. (1) Do Lasso or Post-Lasso Regression of di on 37 , Zi to obtain 7 and 5. 

(2) Do Lasso or Post-Lasso Regression of yi on Xi to get 0. (3) Do Lasso or Post-Lasso 
Regression of di = x '7 + z(5 on Xi to get d. (4) Let pf := y* — x'-d, pf := di — x'pd, and 
Vi := x'fi + z-<5 — x\d. Get estimator a from & by using standard IV regression of pf on 
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pf with Vi as the instrument. Perform inference on ao using a or the associated score 
statistic and conventional heteroscedasticity robust standard errors. 


The following result summarizes the properties of a obtained from Algorithm 1. 


Proposition 1. Under the stated sparsity and other regularity conditions, the estimator 
a defined in Algorithm 1 satisfies y/n(a—a o) A/"(0, V) where V = E[vf]~ 2 E[fi>fiaQ, %) 2 ]- 
The score statistic C(a o) = n|M(ao,?})| 2 /(n _1 YH=\ 'fti ( a o> v)) satisfies C(ao) X 2 (l)- 
Confidence intervals based on these two results are uniformly valid for inference about 
«o over a large class of models. 


The supplementary material provides a precise statement and proof. The theoretical 
results do not depend on wheth er the Lasso estimator or the Post-Lasso estimator of 
Belloni and Chernozhukov ( 2013 ) is used. In the results reported in this paper, we use 


the Post-Lasso estimator. Note that there are other algorithms that would yield similar 
asymptotic properties. For example, one could follow the double-selection strategy more 
closely by running Lasso regression of d{ on Xi and Zi, Lasso regression of di on Xi, Lasso 
regression of yi on Xi, and then forming a 2SLS estimator using instruments selected in 
the first step and controlling for the union of controls selected in the three Lasso steps. 


2. Simulation Example 

To illustrate the preceding discussion, we report results from a small simulation exper¬ 
iment. Data were generated from the model given in Section 2 with n = 200, pif = 300, 
and pif = 150. Other parameter values were chosen so that the infeasible, optimal instru¬ 
ments are “strong”, perfect model selection is impossible, and the sparse model provides 
a good approximation. Further details are available in the supplementary material. 

We provide results for four different estimators - an infeasible Oracle estimator that 
knows the nuisance parameters p (Oracle), two naive estimators, and the “Double- 
Selection” estimator. The first naive estimator follows Algorithm 1 but replaces Lasso/Post- 
Lasso with stepwise regression with p-value for entry of .05 and p-value for removal of 
.10 (Naive 1). It is well-known that this procedure fails to control model selection mis¬ 
takes in which irrelevant variables are included. The second naive estimator estimates 
the high-dimensional nuisance functions using Post-Lasso but uses the moment condi¬ 
tion E [(py — pfa)(x > i S + z\ 7 )] = 0 (Naive 2). This moment condition does not satisfy 
the orthogonality condition described above, though estimation and inference about «o 
using this condition will be valid when perfect model selection for the regression of y on 
x and d on x is possible. 

We report the median bias (Bias), median absolute deviation (MAD), and size of 5% 
level tests (Size) obtained from 1000 simulation replications for each procedure. For the 
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Oracle, we have Bias of .006, MAD of .095, and Size of .043. For Naive 1, Bias, MAD, 
and Size are .160, .227, and .302 respectively; and Bias, MAD, and Size are respectively 
.035, .103, and .095 for Naive 2. Finally, the Double-Selection approach gives Bias of 
.021, MAD of .099, and Size of .054. 

These results correspond to the discussion in Section I. The first naive, unprincipled 
procedure fails to control spurious inclusion of irrelevant variables and performs quite 
poorly relative to the other three approaches. The second naive procedure can be shown 
to be formally valid when perfect model selection is possible and performs relatively well 
in terms of MAD. However, the asymptotic approximation under perfect model selection 
provides a misleading approximation to the true sampling distribution as evidenced by 
the size distortion. Finally, we see that basing estimation and inference on a principled 
variable selection procedure and moment conditions that are immunized against small 
model selection mistakes produces an estimator that performs well relative to the in¬ 
feasible Oracle in terms of both estimation and inference performance as measured by 
MAD and Size. 


3. Empirical Example 


We conclude with a brief empirical example where we estimate the coefficients in 
a_simple_modeljrf demand for automobiles. We use the data and basic strategy of 


Berry et al. ( 1995l l. For simplicity, we consider the most basic specification 


log (s it ) - log(sot) = a 0 p it + x' it /3o + £it 
Pit = z' it 8 o + x'itlo + Uit 

where sn is the market share of product i in market t with product 0 denoting the 
outside option, pa is price and treated as endogenous, xa are observed included product 
characteristics, and za are instruments. One could also consider allowing random coe f- 
ficients and adapting the variable selection procedures to this setting; see 
(129141 1. 


Gillen et al. 


In their basic results, Berry et al. (1995) use five variables in Xu'. a constant, an air 
conditioning dummy, horsepower divided by weight, miles per dollar, and vehicle size. 
They argue that characteristics of other products provide valid instruments for price 
and choose 10 instruments for pa based on intuition and an exchangeability argument. 
The first five instruments are formed by deleting product i and then summing each 
characteristic in x across all remaining products produced by product i’s firm. The other 
five instruments are similarly constructed by deleting all products from product i’s firm 
and then summing each characteristic in x across all remaining products. Using these 
controls and instruments, the 2SLS estimate of a is -.142 with an estimated standard 
error of .012. One might compare this to the OLS estimate obtained treating price as 
exogenous given the five controls listed above which is -.089 with estimated standard of 
.004. 
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It is interesting to note that Berry et al. ( 19951 ) state, “The choice of which attributes 
to include in the utility function is, of course, ad hoc” (p. 872). They similarly 

no te t hat one could have considered additional instruments such as higher order terms 
([Berry et all Il 995 . p. 861). The high-dimensional methods outlined in this paper offer 


one strategy to help address these concerns which complements the well-founded eco¬ 
nomic intuition motivating the authors’ choices. We apply our outlined methods in two 
scenarios. In the first, we apply the method using just the original five controls and 
10 instruments. In the second, we augment the set of potential controls with a time 
trend, quadratics, and cubics in all continuous variables, and all first order interactions 
and then use sums of these characteristics as potential instruments following the original 
strategy. These additions give a total of 24 rc-variables and 48 potential instruments. 
We include the intercept in all models and select over the remaining variables. 


In both cases, the results suggest demand is more elastic than indicated in the baseline 
results. After selection using only the original variables, we estimate the price coefficient 
to be -.185 with an estimated standard error of .014. In this case, all five controls are 
selected in the log-share on controls regression, all five controls but only four instruments 
are selected in the price on controls and instruments regression, and four of the controls 
are selected for the price on controls relationship. The difference between the baseline 
results is thus largely driven by the difference in instrument sets. The change in the 
estimated coefficient is consistent with the wisdom from the many-instrument literature 
that inclusion of irrelevant instruments biases 2SLS toward OLS. 


With the larger set of variables, our post-model-selection estimator of the price coeffi¬ 
cient is -.221 with an estimated standard error .015. Here, we see some evidence that the 
original set of controls may have been overly parsimonious. In the log-share on controls 
regression, we have that eight control variables are selected; and we have seven controls 
and only four instruments selected in the price on controls and instrument regression. 
We also have that 13 variables are selected for the price on controls relationship. The se¬ 
lection of these additional variables suggests that there is important nonlinearity missed 
by the baseline set of variables. 

Finally, we note that in terms of own-price elasticities, the results become more plau¬ 
sible as we move from the baseline results to the results based on variable selection with 
a large number of controls. Recall that facing inelastic demand is inconsistent with 
profit maximizing price choice within the present context, so theory would predict that 
demand should be elastic for all products. However, the baseline point estimates imply 
inelastic demand for 670 products. Using the variable selection results provides results 
closer to the theoretical prediction. The point estimates based on selection from only 
the baseline variables imply inelastic demand for 139 products, and we estimate inelastic 
demand for only 12 products using the results based on selection from the larger set of 
variables. Thus, the new methods provide the most reasonable estimates of own-price 
elasticities. Of course, the simple specification above suffers from the usual drawbacks of 
the logit demand model, but the example illustrates how the application of the methods 
outlined in this note may be used in estimation of structural parameters in economics 
and add to the plausibility of the resulting estimates. 












POST-SELECTION AND POST-REGULARIZATION 


7 


4. Conclusion 


A great deal of empirical economic research aiming to estimate causal or structural 
effects depends on using the right set of controls and instruments. The need for for¬ 
mal methods that perform this model selection and inference procedures that remain 
valid following model selection is likely to increase in importance as data sets become 
richer. We have outlined one simple approach that can be used in an instrumental vari¬ 
ables m ode l with many instruments and controls that extends Belloni et al. ( 20121 4 and 
Belloni. Chernozhukov and Hansen ( 20l4 ). The approach relies on an approximate spar¬ 
sity assumption and the use of high-quality variable selection procedures coupled with 
the use of appropriate moment functions. These ideas follow from the general framework 
conside red in Belloni et al.1 (120131). For more applicatio n s of similar ideas in economics , 
see also Bai and Nd ( 20091b iBelloni et al. ( ArXiv. 2010& 1: Gautier and Tsvbakovl ( 2011 1: 
Belloni et al. ( 2010 J ) ; and iBelloni. Chernozhukov. Hansen and Kozbur ( 2014 ) and ref¬ 
erences therein. 
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